Graded task Logistic regression
Your task is to create a logistic regression model and to predict whether the patient has 10-year risk of future coronary heart disease (CHD). The data is given in the sheet Heart_desease. The dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts.
Requirements: 1. Create a logistic regression model and estimate predictions. 2. Interpret models’ coefficients and Odd ratios. 3. Assess the model with performance metrics.”
Evaluation criteria: “1. How well the model creation in the task description is followed. 2. You can explain how you selected the best model. 3. You can interpret the models’ performance metrics and parameters. 4. You present fluent and clearly. Presentation should take 10-15 minutes. 5. Analytical approach to the problem”
Sample questions: “1. How have you selected your best model? 2. Why have you used the particular model’s performance metrics? 3. What is Odds ratio? 4. How do you interpret regression model’s intercept? 5. How do you interpret regression model’s coefficients?”
Medical( history)
Predict variable (desired target) - 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)
Abbreviations: - EDA : Exploratory Data Analysis - EV : Explanatory Variable (i.e. predictor)
library(tidyverse)
library(janitor)
library(emmeans)
library(dlookr)
library(flextable)
library(jtools)
library(GGally)
library(vcd)
library(ggstatsplot)
library(gridExtra)
library(caret)
library(jtools)
library(kableExtra)
select <- dplyr::select
bd <- "/Users/leonardo/My Drive/turing_college/Modules/Linear_and_Logistic_Regression/"
df <- read_csv(paste0(bd,"Graded_task_Heart_disease_data.csv")) %>%
tibble %>%
janitor::clean_names()
# change appropriate variables to factor
df <- df %>% mutate_at(
c("male","education","current_smoker","bp_meds",
"prevalent_stroke","prevalent_hyp",
"diabetes","ten_year_chd"), factor
)
# df %>% glimpse()
# dlookr::diagnose(df) %>% flextable
To estimate the perfomance of the model with out-of-sample data, we split the whole dataset in a train (60% - 2543 data points) and test (40% - 1695 data points) set.
set.seed(124)
N = nrow(df)
split_index <- sample(c("train","test"), N, replace = T, prob = c(0.6,0.4))
train <- df[which(split_index == "train"),] # train set
test <- df[which(split_index == "test"),] # test set
We will carry out the check for outliers - and subsequently evaluate potential imputation for NA - only on the train data, and blindly apply this to the test. In a real world situation, the test set can grow continously, therefore we need to take decisions only on the train data.
The evaluation for potential NA imputation will be carried out later, just before modelling. The reason for this is that carrying out NA imputation first would bias the EDA.
Observations All the numerical values are in the norm, except for blood pressure. Specifically, the systolic blood pressure sys_bp is way too high, with about 4% of the participants having a bp > 180 (requiring immediate medical care).
However, in published papers on the Framinghton studies these values are accepted as plausible. We will therefore only exclude values above 250 mmHg.
# train %>% select_if(is.numeric) %>% summary()
# boxplot(glucose ~ diabetes, data = train, main = "glucose ~ diabetes")
boxplot(sys_bp ~ bp_meds, data = train, main = "systolic blood pressure ~ bp_meds")
boxplot(sys_bp ~ prevalent_hyp, data = train, main = "systolic blood pressure ~ hypertension")
# percent of people with sys_bp > 180
pct_high_sys_bp <- round((train$sys_bp > 180) %>% sum() / nrow(train) * 100,2)
# xtabs(~ prevalent_hyp + bp_meds, data = train)
train <- train %>% filter(sys_bp <= 250)
test <- test %>% filter(sys_bp <= 250)
Initially, we carry out t-tests to discover which EVs exhibit (corrected) significant differences between people with and without risk of CHD.
Then we will also examine the potential correlations between variables. If the latter are substantial, we will evaluate the possibility of discarding some EVs due to collinearity.
(an excercise in fitting many models using purrr::map)
library(tidyr)
ttests <- train %>%
# select("ten_year_chd", "age", "glucose") %>%
select(-c("education","male","current_smoker","bp_meds",
"prevalent_stroke","prevalent_hyp", "diabetes")) %>%
na.omit() %>%
# pivot longer to prepare the nesting
pivot_longer(
cols = -ten_year_chd,
names_to = "variable", values_to = "value"
) %>%
group_by(variable) %>%
group_nest() %>%
# fit a model to each nested df
mutate(
ttmod = map(data, ~ lm(value ~ ten_year_chd, data = .x))
) %>%
# get the stats
mutate(
tidy = map(ttmod, broom::tidy)
) %>%
unnest(tidy) %>%
filter(term == "ten_year_chd1") %>%
# correct for multiple comparisons
mutate(adj_pval = p.adjust(p.value, method = "fdr")) %>%
# filter(adj_pval <= 0.01) %>%
relocate(adj_pval, .after=variable) %>%
arrange(adj_pval)
ttests %>%
select(variable, statistic, adj_pval) %>%
rename(`t value` = statistic) %>%
flextable() %>%
set_formatter(adj_pval = function(x) sprintf("%.2e", x))
variable | t value | adj_pval |
|---|---|---|
age | 12.0815563 | 1.01e-31 |
sys_bp | 11.9732840 | 1.73e-31 |
dia_bp | 8.2215602 | 8.85e-16 |
glucose | 6.5653995 | 1.28e-10 |
bmi | 5.6308260 | 3.22e-08 |
tot_chol | 4.0326391 | 7.59e-05 |
cigs_per_day | 2.1141771 | 3.96e-02 |
heart_rate | 0.8512159 | 3.95e-01 |
ggbetweenstats(
data = train,
x = ten_year_chd,
y = age
)
ggbetweenstats(
data = train,
x = ten_year_chd,
y = sys_bp
)
#
ggbetweenstats(
data = train,
x = ten_year_chd,
y = dia_bp
)
#
ggbetweenstats(
data = train,
x = ten_year_chd,
y = dia_bp
)
#
ggbetweenstats(
data = train,
x = ten_year_chd,
y = bmi
)
#
ggbetweenstats(
data = train,
x = ten_year_chd,
y = tot_chol
)
Exploring the correlation matrix of the numerical variables to detect potential collinearity
train %>%
# select(-c("education","male","current_smoker","bp_meds",
# "prevalent_stroke","prevalent_hyp", "diabetes")) %>%
select(ten_year_chd, age, sys_bp, dia_bp, bmi, glucose, tot_chol) %>%
na.omit() %>%
# mutate_at(vars(-ten_year_chd), ~ log(. + 1)) %>%
GGally::ggpairs(
aes(color = ten_year_chd),
upper = list(continuous = wrap("cor", size = 3)),
lower = list(continuous = wrap("points", alpha = 0.7, size = 1)),
diag = list(continuous = wrap("densityDiag", alpha = 0.5))
)
age appears to be the most discriminant variable for ten year CHD prediction. Other EVs which are significantly different (after correction with FDR) in people with and without CHD risk are bmi, sys_bp, dia_bp, glucose, tot_chol.
the systolic sys_bp and diastolic dia_bp blood pressure are correlated with each other - as expected. Since CHD is especially related to hypertension, we will only use systolic blood pressure
bmi and age are also correlated with blood pressure, however they appear to be important variables for CHD and their correlation with blood pressure is mild, so we will keep them
NB: the number of cigarettes per day (cigs_per_day) will be explored as a categorical variable (see below) with levels for tens of cigarettes per day
# train %>% select_if(is.factor) %>% summary
As before for the numerical variables, we will first assess a dependence between risk of CHD and other categorical variables, and then explore the reciprocal relationships of all categorical variables to assess the presence of collinearity.
library(ggstatsplot)
ggbarstats(
data = train %>% select(ten_year_chd, male),
x = male,
y = ten_year_chd,
label = "both"
)
ggbarstats(
data = train %>% select(ten_year_chd, prevalent_hyp),
x = prevalent_hyp,
y = ten_year_chd,
label = "both"
)
The values are too sparse to treat this as a continous variable. We will therefore group the # cigs per day in tens and treat this as a categorical variable.
A first visual exploration shows that there appear to be mild differences in the proportion of people smoking different amount of cigarettes per day.
When carrying out a chi-square test of independence, we are warned that the results might be incorrect. This appears to be due to the fact that there are too few observations in the category for 50 and 60 cigarettes per day. When these levels are removed, a significant (p = 0.0066) dependence is shown between risk group and amount of cigarettes per day.
# Suppress summarise info
options(dplyr.summarise.inform = FALSE)
train %>%
select(ten_year_chd, cigs_per_day) %>%
na.omit() %>%
mutate(cigs_factor = round(cigs_per_day/10,0)) %>%
group_by(ten_year_chd, cigs_factor) %>%
summarise(count = n(), .group = "drop") %>%
group_by(ten_year_chd) %>%
mutate(prop = count / sum(count)) %>%
select(ten_year_chd, cigs_factor, prop) %>%
ggplot(aes(x = factor(cigs_factor), y = prop, fill = ten_year_chd)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Tens of Cigarettes per Day", y = "Proportion in each group", fill = "CHD Status")
train %>%
select(ten_year_chd, cigs_per_day) %>%
mutate(tens_cigarettes = round(cigs_per_day/10, 0) ) %>%
filter(tens_cigarettes < 5) %>%
select(-cigs_per_day) %>%
na.omit() %>%
xtabs(~ ten_year_chd + tens_cigarettes, data = .) %>%
chisq.test() %>% tidy() %>% kable(format = "html") %>%
add_header_above(c("Chi-Square CHD Risk ~ Tens_of_cigarettes" = 4)) %>%
kable_styling()
| statistic | p.value | parameter | method |
|---|---|---|---|
| 14.22453 | 0.0066119 | 4 | Pearson’s Chi-squared test |
ggbarstats(
data = train %>% select(ten_year_chd, current_smoker),
x = current_smoker,
y = ten_year_chd,
label = "both"
)
ggbarstats(
data = train %>% select(ten_year_chd, bp_meds),
x = bp_meds,
y = ten_year_chd,
label = "both"
)
ggbarstats(
data = train %>% select(ten_year_chd, education),
x = education,
y = ten_year_chd,
label = "both"
)
Graphically display the association between binary categorical variables, including ten_year_chd to assess the presence of correlated predictors
library(vcd)
train_cat <- train %>% select_if(is.factor) %>% select(-c("ten_year_chd","education"))
pairs(
table(train_cat),
diag_panel = pairs_diagonal_mosaic(offset_varnames=-2.5), #move down variable names
upper_panel_args = list(shade = TRUE), #set shade colors
lower_panel_args = list(shade = TRUE)
)
Numerical variables Age (age), High blood pressure (sys_bp), body-mass index (bmi), glucose blood level (glucose), cholesterol blood level (tot_chol) appear to be involved in predicting the risk of CHD. As noted above, cigs_per_day was treated as a categorical variables with levels for o and multiples of 10.
Categorical variables Sex (male), hypertension (prevalent_hyp), blood pressure medications (bp_meds), education (education) and number of cigarettes per day (cigs_per_day) appear to be involved in predicting the risk of CHD.
There are widespread associations between numerical and categorical variables. This is not surprising for two reasons:
sys_bp, prevalence_hyp, bp_meds)age, the bmi increases, and both are likely to lead to increased blood pressure sys_bp, which in turn can lead to hypertension and relative medicamentsDespite this, we will enter all the selected variables in the model and assess the presence of collinearity using VIF.
library(gridExtra)
boxplot_tt <- function(response, predictor) {
tt <- t.test(train[[response]] ~ train[[predictor]], data =train)
res <- paste0("t = ", round(tt$statistic,2), " p = ", round(tt$p.value,4))
p <- train %>% na.omit() %>%
ggplot(aes(x = .data[[predictor]], y = .data[[response]])) +
geom_boxplot() +
labs(title = paste0(predictor," vs ", response),
subtitle = res)
return(p)
}
p1 <- boxplot_tt("age", "male")
p2 <- boxplot_tt("age", "prevalent_hyp")
p3 <- boxplot_tt("age", "bp_meds")
grid.arrange(p1, p2, p3, nrow = 1)
p1 <- boxplot_tt("bmi", "male")
p2 <- boxplot_tt("bmi", "prevalent_hyp")
p3 <- boxplot_tt("bmi", "bp_meds")
grid.arrange(p1, p2, p3, nrow = 1)
p1 <- boxplot_tt("sys_bp", "male")
p2 <- boxplot_tt("sys_bp", "prevalent_hyp")
p3 <- boxplot_tt("sys_bp", "bp_meds")
grid.arrange(p1, p2, p3, nrow = 1)
p1 <- boxplot_tt("glucose", "male")
p2 <- boxplot_tt("glucose", "prevalent_hyp")
p3 <- boxplot_tt("glucose", "bp_meds")
grid.arrange(p1, p2, p3, nrow = 1)
p1 <- boxplot_tt("tot_chol", "male")
p2 <- boxplot_tt("tot_chol", "prevalent_hyp")
p3 <- boxplot_tt("tot_chol", "bp_meds")
grid.arrange(p1, p2, p3, nrow = 1)
As a last step before fitting the models, we need to take care of NAs in these 6 variables. As mentioned in the beginning, we do this here in order not to bias the EDA.
In case of imputation, we will apply the same procedure blindly to the test set.
bmi is significantly different for males and females. There are only 12 values missing. We will impute the NAs with the median of each gender in either risk or no-risk group.
glucose and tot_chol are significantly different especially in people with and without hypertension. We will impute the NAs with the median of each gender in either risk or no-risk group.
bp_meds is a very important variable, and not easy to impute, so we don’t want to run risks here. We see from the table below that the people where blood pressure (bp) medication was not recorded have (median) bp values compatible with people who have either high or low blood pressure AND do not take medications. This could justify the imputation. At the same time, this is a very important variable, associated with drug use. We don’t want to run this risk. We will therefore discard people where it is not known whether they take medications.
cigs_per_day is a variable that can hardly be imputed correctly, therefore we will remove the data points with NA in this variable (less than 1%)
# bmi ~ male
ggbetweenstats(
data = train,
x = male,
y = bmi
)
# glucose ~ hypertension
ggbetweenstats(
data = train,
x = prevalent_hyp,
y = glucose
)
# cholesterol ~ hypertension
ggbetweenstats(
data = train,
x = prevalent_hyp,
y = tot_chol
)
# bp_meds ~ prevalent_hyp
df %>%
select(sys_bp, prevalent_hyp, bp_meds) %>%
group_by(bp_meds, prevalent_hyp) %>%
summarise(
median_sys_bp = median(sys_bp),
count_prevalent_hyp = n()
)
## # A tibble: 5 × 4
## # Groups: bp_meds [3]
## bp_meds prevalent_hyp median_sys_bp count_prevalent_hyp
## <fct> <fct> <dbl> <int>
## 1 0 0 122 2891
## 2 0 1 150. 1170
## 3 1 1 165 124
## 4 <NA> 0 121 31
## 5 <NA> 1 154. 22
NA imputation on train and test set
# ------------------------- train --------------------------------------------
train <- train %>%
# impute the bmi as the median of males/females
group_by(ten_year_chd, male) %>%
mutate(bmi = ifelse(is.na(bmi), median(bmi, na.rm = T),bmi)) %>%
ungroup() %>%
# impute glucose and cholesterol level as the median of prevalent_hyp
group_by(ten_year_chd, prevalent_hyp) %>%
mutate(glucose = ifelse(is.na(glucose), median(glucose, na.rm=T), glucose)) %>%
mutate(tot_chol = ifelse(is.na(tot_chol), median(tot_chol, na.rm=T), tot_chol)) %>%
ungroup() %>%
# remove people where bp_meds is not unknown
filter(!is.na(bp_meds)) %>%
# remove people with no information about education level
filter(!is.na(education)) %>%
# # remove the mean from numerical variables
# mutate_if(is.numeric, ~ . - mean(., na.rm = TRUE)) %>%
# remove residual NAs
na.omit()
train %>%
select(
ten_year_chd, age, sys_bp, bmi, glucose, tot_chol,
male, prevalent_hyp, bp_meds, education
) %>%
diagnose %>%
flextable
variables | types | missing_count | missing_percent | unique_count | unique_rate |
|---|---|---|---|---|---|
ten_year_chd | factor | 0 | 0 | 2 | 0.0008093889 |
age | numeric | 0 | 0 | 39 | 0.0157830838 |
sys_bp | numeric | 0 | 0 | 213 | 0.0861999191 |
bmi | numeric | 0 | 0 | 1,099 | 0.4447592068 |
glucose | numeric | 0 | 0 | 121 | 0.0489680291 |
tot_chol | numeric | 0 | 0 | 227 | 0.0918656414 |
male | factor | 0 | 0 | 2 | 0.0008093889 |
prevalent_hyp | factor | 0 | 0 | 2 | 0.0008093889 |
bp_meds | factor | 0 | 0 | 2 | 0.0008093889 |
education | factor | 0 | 0 | 4 | 0.0016187778 |
# ------------------------- test --------------------------------------------
test <- test %>%
# impute the bmi as the median of males/females
group_by(ten_year_chd, male) %>%
mutate(bmi = ifelse(is.na(bmi), median(bmi, na.rm = T),bmi)) %>%
ungroup() %>%
# impute glucose and cholesterol level as the median of prevalent_hyp
group_by(ten_year_chd, prevalent_hyp) %>%
mutate(glucose = ifelse(is.na(glucose), median(glucose, na.rm=T), glucose)) %>%
mutate(tot_chol = ifelse(is.na(tot_chol), median(tot_chol, na.rm=T), tot_chol)) %>%
ungroup() %>%
# remove people where bp_meds is not unknown
filter(!is.na(bp_meds)) %>%
# remove people with no information about education level
filter(!is.na(education)) %>%
# # remove the mean from numerical variables
# mutate_if(is.numeric, ~ . - mean(., na.rm = TRUE)) %>%
# remove residual NAs
na.omit()
test %>%
select(
ten_year_chd, age, sys_bp, bmi, glucose, tot_chol,
male, prevalent_hyp, bp_meds, education
) %>%
diagnose %>%
flextable
variables | types | missing_count | missing_percent | unique_count | unique_rate |
|---|---|---|---|---|---|
ten_year_chd | factor | 0 | 0 | 2 | 0.001265823 |
age | numeric | 0 | 0 | 37 | 0.023417722 |
sys_bp | numeric | 0 | 0 | 201 | 0.127215190 |
bmi | numeric | 0 | 0 | 859 | 0.543670886 |
glucose | numeric | 0 | 0 | 102 | 0.064556962 |
tot_chol | numeric | 0 | 0 | 213 | 0.134810127 |
male | factor | 0 | 0 | 2 | 0.001265823 |
prevalent_hyp | factor | 0 | 0 | 2 | 0.001265823 |
bp_meds | factor | 0 | 0 | 2 | 0.001265823 |
education | factor | 0 | 0 | 4 | 0.002531646 |
It appears that there are two interesting questions:
How well is the risk predicted by generic factors alone? (such as sex, age, and blood pressure)
How much can we increase the prediction if we include other variables more related to hypertension, such as diagnosed hypertension or assumption of blood pressure medications?
Since we allowed some correlated variables to enter the model, we will pay close attention to the VIF (variance inflation factor).
After the “best” model is detemined by manual stepwise regression, we will compare the former to an automatic stepwise regression carried out using the AIC criterion alone.
We start by fitting the “maximal” model, which includes all the provided EVs, to have a baseline estimate for the AUC of this model and of the impact of the correlation between variables we previously detected with EDA - for the latter, we will check the VIF.
Then we fit the full model, which includes all the variables we previously selected based on the EDA. This will show how well the hints gathered in the EDA are actually reflected in the significance of the coefficients. Specifically, compared to the “maximal” model.
We will then proceed to eliminate EVs based on whether the coefficients are not sigificant or have a very small effect (in terms of odds ratio). At each step we evaluate the AIC and AUC to detect whether reducing the EVs led to an increase or decrease of the performance.
Finally we will examine the confusion matrix and adjust the threshold for binary prediction in order to have maximum Sensitivity (more on this later).
The coefficients are reported in two forms (at the expense of verbosity):
original, non-standardized and non-exponentiated (using summary()): to allow calculation of the odds ratio (via exp()) of 10 year CHD given the actual values of the EVs for a specific person.
standardized and exponentiated (using jtools::summ()): to allow comparing the magnitude of the coefficients and to calculate the VIF.
The AIC was similar across all examined models (between 1887 and 1890), therefore this metric had a small weight in the choice of the final model. Similarly, all the models explained at most 17% of the total variance in the data, which is not a substantial amount.
The VIF was not critical, even in the maximal model, therefore the impact of correlated predictors was minimal or negligible.
The maximal model highlighted the contribution of gender, age, systolic blood pressure, and to a lesser extent the prevalence of hypertension and education. Importantly, the p values for some of these variables was higher than in simpler models, most likely due to collinearity with other variables. The AUC of this model was 0.703.
Our full model highlighted the same variables. In addition reducing the number of EVs increased the significance of most of the selected EVs (especially age and systolic blood pressure). The AUC of this model was 0.705.
We then removed EVs with small or no effect from the full model, specifically blood pressure medications, bmi, cholesterol and education. This led to an AUC of 0.711, and all variables significant.
Further removing the EVs specifying whether the patient had hypertension led to a simpler model and did not substantially change the AUC (0.713).
Remarkably, this model contained only gender, age, systolic blood pressure and glucose blood level, in addition to number of cigarettes smoked per day. That is, two common demographic variables, and two physiologic measurements which can be easily retrieved with blood pressure measurement and a simple blood exam
The increase in odds to develop a CHD in 10 years were as follows for each EV:
The intercept can be interpreted as the probability of a female being at risk when all predictors are set to 0. In our case we have predictors for age, glucose and blood pressure, which are obviously not zero, therefore this interpretation of the intercept is not really helpful in our case. For the sake of calculation: \(p = \frac{e^{\beta_0}}{1 + e^{\beta_0}} = \frac{exp(-8.643)}{1+exp(-8.643)} = 0.000176\)
It is at this point very interesting to notice that:
This is a very interesting result from a practical perspective, since it means that generic demographic and physiological variables are much more predictive of the risk of CHD in 10 years with respect to a diagnosis of hypertension, and that the assumption of blood pressure medicaments - presumably to normalize the blood pressure level in people with hypertension - does not reduce the risk of CHD
Our final model therefore includes:
Afterwards, we evaluated the performance of our model on the test data. Using a default probability of 0.5 for binary prediction leads to a poor Sensitivity, in that only ~ 7% of the people with a real 10 year CHD risk (\(\frac{16}{16+219}\)) are predicted as being at risk.
Since in our case the aim is to maximise the detection of people who are truly at risk - even at expense of inflating the number of false negatives - we lowered the threshold of binary prediction to 0.1. This allowed us to reach a sensitivity of 84%.
The automatic stepwise regression (stepAIC) identifies the same variables as our final model, plus a small (uncorrected significant) contribution of education level and prevalence of hypertension. However, this model has a smaller AUC (0.703) than our final model, and is also more complex. Therefore we retained our final model.
Finally, we hypothesized that the imbalance between Sensitivity and Specificity could be due to the high proportion of people not at risk in our dataset. Therefore, as a last test, we sampled an equal number of people with and without risk.
In this balanced samples model the AUC increased to 0.738. By using a threshold for binary decision of 0.3, this model correctly predicts 91.4% of the people with actual CHD risk, and a false negative rate of 70%. By adopting a more liberal threshold of 0.4, the model predicts 81.3% of the people with actual CHD, and about 50% of the people with no risk of CHD.
A final remark This model captures only a fraction of the total variance in the data (~ 15%). It is odd that other variables widely thought to be associated with cardiovascular diseases were not recorded, such as stress level (e.g. # hours of work per day/week), alcohol comsumption, quality of the air.
do_logistic_regression : a function to carry out the logistic regression and print out the summary of the model, the ROC curve - and the AUC - as well as the confusion matrix.
do_logistic_regression <- function(EVs, train_data = train, test_data = test, thresh = 0.5) {
# Assemble the formula
formula_string <- paste(EVs, collapse = " + ")
formula_obj <- as.formula(paste("ten_year_chd ~", formula_string))
fit <- glm(formula_obj, family = "binomial", data = train_data)
# # Standard R summary
summary(fit) %>% print()
# Print a summary
# summary(fit) %>% print()
if (length(EVs) > 1) {
jtools::summ(fit, confint = T, vifs = T, scale = F, exp = T, digits = 2) %>% print()
} else {
jtools::summ(fit, confint = T, vifs = T, scale = F, exp = T, digits = 2) %>% print()
}
# Estimated probability values from the logistic model
phat = predict(fit, newdata = test_data, type = "response")
# ROC curve using pROC::roc
# https://www.youtube.com/watch?v=qcvAqAH60Yw&t=493s
par(pty = "s")
pROC::roc(
test_data$ten_year_chd ~ phat, plot = TRUE, print.auc = TRUE, legacy.axes = TRUE,
xlab="% False Positives", ylab = "% True Positives", col = "#377eb8", lwd = 4
)
# Confusion matrix and derived quantities - using caret::confusionmatrix
sprintf("\n Using a threshold for prediction = %.2f \n\n", thresh) %>% cat()
thresh <- thresh
predicted <- ifelse(phat > thresh, 1, 0)
actual_predicted <- tibble(
actual = test_data$ten_year_chd,
predicted = ifelse(phat > thresh, 1, 0)
)
caret::confusionMatrix(
as.factor(predicted), # predicted
as.factor(test_data$ten_year_chd), # actual
positive = "1" # the positive class in our case is "1"
) %>% print()
return(phat)
}
Maximal model, including all the variables in the dataset. We use this only as a benchmark for the models instructed by EDA.
We observe high VIF values in sys_bp and dia_bp - as expected.
EVs <- train %>% select(-ten_year_chd) %>% colnames()
phat <- do_logistic_regression(EVs, thresh = 0.5)
##
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8415 -0.5828 -0.4175 -0.2897 2.8226
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.4497344 0.8561698 -8.701 < 2e-16 ***
## male1 0.3800302 0.1329243 2.859 0.00425 **
## age 0.0608765 0.0083022 7.333 2.26e-13 ***
## education2 -0.3243521 0.1517706 -2.137 0.03259 *
## education3 -0.3327557 0.1860997 -1.788 0.07377 .
## education4 -0.0502180 0.1969542 -0.255 0.79874
## current_smoker1 0.1717930 0.1922979 0.893 0.37166
## cigs_per_day 0.0149836 0.0078998 1.897 0.05787 .
## bp_meds1 0.3755894 0.2932894 1.281 0.20033
## prevalent_stroke1 0.8294602 0.6158164 1.347 0.17800
## prevalent_hyp1 0.3669952 0.1660561 2.210 0.02710 *
## diabetes1 0.3126526 0.3897172 0.802 0.42241
## tot_chol 0.0004310 0.0014095 0.306 0.75980
## sys_bp 0.0121563 0.0046474 2.616 0.00890 **
## dia_bp 0.0004736 0.0078561 0.060 0.95193
## bmi 0.0141265 0.0151335 0.933 0.35058
## heart_rate -0.0068913 0.0050928 -1.353 0.17601
## glucose 0.0063839 0.0028683 2.226 0.02604 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2114.4 on 2470 degrees of freedom
## Residual deviance: 1854.4 on 2453 degrees of freedom
## AIC: 1890.4
##
## Number of Fisher Scoring iterations: 5
##
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(17) = 259.91, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.17
## Pseudo-R² (McFadden) = 0.12
## AIC = 1890.45, BIC = 1995.07
##
## Standard errors: MLE
## -------------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p VIF
## ----------------------- ----------- ------ ------- -------- ------ ------
## (Intercept) 0.00 0.00 0.00 -8.70 0.00
## male1 1.46 1.13 1.90 2.86 0.00 1.25
## age 1.06 1.05 1.08 7.33 0.00 1.32
## education2 0.72 0.54 0.97 -2.14 0.03 1.13
## education3 0.72 0.50 1.03 -1.79 0.07 1.13
## education4 0.95 0.65 1.40 -0.25 0.80 1.13
## current_smoker1 1.19 0.81 1.73 0.89 0.37 2.61
## cigs_per_day 1.02 1.00 1.03 1.90 0.06 2.74
## bp_meds1 1.46 0.82 2.59 1.28 0.20 1.11
## prevalent_stroke1 2.29 0.69 7.66 1.35 0.18 1.04
## prevalent_hyp1 1.44 1.04 2.00 2.21 0.03 1.94
## diabetes1 1.37 0.64 2.93 0.80 0.42 1.88
## tot_chol 1.00 1.00 1.00 0.31 0.76 1.07
## sys_bp 1.01 1.00 1.02 2.62 0.01 3.64
## dia_bp 1.00 0.99 1.02 0.06 0.95 2.92
## bmi 1.01 0.98 1.04 0.93 0.35 1.23
## heart_rate 0.99 0.98 1.00 -1.35 0.18 1.10
## glucose 1.01 1.00 1.01 2.23 0.03 1.89
## -------------------------------------------------------------------------
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Using a threshold for prediction = 0.50
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1329 213
## 1 16 22
##
## Accuracy : 0.8551
## 95% CI : (0.8367, 0.8721)
## No Information Rate : 0.8513
## P-Value [Acc > NIR] : 0.3513
##
## Kappa : 0.1249
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.09362
## Specificity : 0.98810
## Pos Pred Value : 0.57895
## Neg Pred Value : 0.86187
## Prevalence : 0.14873
## Detection Rate : 0.01392
## Detection Prevalence : 0.02405
## Balanced Accuracy : 0.54086
##
## 'Positive' Class : 1
##
This is the “full” model with all the variables chosen based on the EDA.
No problems with collinearity.
EVs <- c(
"male", "prevalent_hyp", "bp_meds", "education",
"age", "sys_bp", "bmi", "glucose", "tot_chol", "cigs_per_day"
)
phat <- do_logistic_regression(EVs, thresh = 0.5)
##
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9398 -0.5882 -0.4186 -0.2876 2.8221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.8924235 0.7224553 -10.924 < 2e-16 ***
## male1 0.3996399 0.1306947 3.058 0.002230 **
## prevalent_hyp1 0.3564960 0.1628251 2.189 0.028565 *
## bp_meds1 0.4381388 0.2883448 1.519 0.128638
## education2 -0.3319065 0.1512202 -2.195 0.028174 *
## education3 -0.3409724 0.1858949 -1.834 0.066621 .
## education4 -0.0373354 0.1959238 -0.191 0.848870
## age 0.0613141 0.0080512 7.616 2.63e-14 ***
## sys_bp 0.0118211 0.0034909 3.386 0.000709 ***
## bmi 0.0137041 0.0146029 0.938 0.348011
## glucose 0.0075754 0.0021169 3.579 0.000345 ***
## tot_chol 0.0002287 0.0014069 0.163 0.870891
## cigs_per_day 0.0191629 0.0052321 3.663 0.000250 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2114.4 on 2470 degrees of freedom
## Residual deviance: 1859.5 on 2458 degrees of freedom
## AIC: 1885.5
##
## Number of Fisher Scoring iterations: 5
##
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(12) = 254.88, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.17
## Pseudo-R² (McFadden) = 0.12
## AIC = 1885.48, BIC = 1961.04
##
## Standard errors: MLE
## ----------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p VIF
## -------------------- ----------- ------ ------- -------- ------ ------
## (Intercept) 0.00 0.00 0.00 -10.92 0.00
## male1 1.49 1.15 1.93 3.06 0.00 1.21
## prevalent_hyp1 1.43 1.04 1.97 2.19 0.03 1.87
## bp_meds1 1.55 0.88 2.73 1.52 0.13 1.09
## education2 0.72 0.53 0.97 -2.19 0.03 1.11
## education3 0.71 0.49 1.02 -1.83 0.07 1.11
## education4 0.96 0.66 1.41 -0.19 0.85 1.11
## age 1.06 1.05 1.08 7.62 0.00 1.25
## sys_bp 1.01 1.00 1.02 3.39 0.00 2.06
## bmi 1.01 0.99 1.04 0.94 0.35 1.16
## glucose 1.01 1.00 1.01 3.58 0.00 1.02
## tot_chol 1.00 1.00 1.00 0.16 0.87 1.06
## cigs_per_day 1.02 1.01 1.03 3.66 0.00 1.22
## ----------------------------------------------------------------------
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Using a threshold for prediction = 0.50
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1332 219
## 1 13 16
##
## Accuracy : 0.8532
## 95% CI : (0.8347, 0.8703)
## No Information Rate : 0.8513
## P-Value [Acc > NIR] : 0.433
##
## Kappa : 0.0915
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.06809
## Specificity : 0.99033
## Pos Pred Value : 0.55172
## Neg Pred Value : 0.85880
## Prevalence : 0.14873
## Detection Rate : 0.01013
## Detection Prevalence : 0.01835
## Balanced Accuracy : 0.52921
##
## 'Positive' Class : 1
##
Remove education, bp_meds, bmi, tot_chol since they have no effect on the model.
The AIC slightly decreases with respect to the full model.
# Remove education, bp_meds, bmi, tot_chol since they have
# no effect on the model
EVs <- c(
"male", "prevalent_hyp",
"age", "sys_bp", "glucose", "cigs_per_day"
)
phat <- do_logistic_regression(EVs, thresh = 0.5)
##
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1088 -0.5889 -0.4241 -0.2951 2.7635
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.027822 0.562910 -14.261 < 2e-16 ***
## male1 0.434934 0.127776 3.404 0.000664 ***
## prevalent_hyp1 0.387263 0.160552 2.412 0.015862 *
## age 0.064891 0.007775 8.346 < 2e-16 ***
## sys_bp 0.013408 0.003393 3.952 7.76e-05 ***
## glucose 0.007572 0.002106 3.595 0.000324 ***
## cigs_per_day 0.018188 0.005197 3.500 0.000465 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2114.4 on 2470 degrees of freedom
## Residual deviance: 1870.0 on 2464 degrees of freedom
## AIC: 1884
##
## Number of Fisher Scoring iterations: 5
##
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(6) = 244.31, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.16
## Pseudo-R² (McFadden) = 0.12
## AIC = 1884.05, BIC = 1924.73
##
## Standard errors: MLE
## ----------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p VIF
## -------------------- ----------- ------ ------- -------- ------ ------
## (Intercept) 0.00 0.00 0.00 -14.26 0.00
## male1 1.54 1.20 1.98 3.40 0.00 1.16
## prevalent_hyp1 1.47 1.08 2.02 2.41 0.02 1.83
## age 1.07 1.05 1.08 8.35 0.00 1.17
## sys_bp 1.01 1.01 1.02 3.95 0.00 1.95
## glucose 1.01 1.00 1.01 3.60 0.00 1.01
## cigs_per_day 1.02 1.01 1.03 3.50 0.00 1.21
## ----------------------------------------------------------------------
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Using a threshold for prediction = 0.50
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1335 219
## 1 10 16
##
## Accuracy : 0.8551
## 95% CI : (0.8367, 0.8721)
## No Information Rate : 0.8513
## P-Value [Acc > NIR] : 0.3513
##
## Kappa : 0.0958
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.06809
## Specificity : 0.99257
## Pos Pred Value : 0.61538
## Neg Pred Value : 0.85907
## Prevalence : 0.14873
## Detection Rate : 0.01013
## Detection Prevalence : 0.01646
## Balanced Accuracy : 0.53033
##
## 'Positive' Class : 1
##
Remove prevalent_hyp since it is only marginally significant
Note that the AIC only marginally increases (less than 0.2%) and this model is much simpler and requires only 4 variables, two of which are common demographic variables (age and sex), while the other two can be easily gathered through a pressure measurement and a blood test.
# Remove prevalent_hyp since it is only marginally significant
EVs <- c(
"male", "age", "sys_bp", "glucose", "cigs_per_day"
)
phat <- do_logistic_regression(EVs, thresh = 0.5)
##
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1053 -0.5899 -0.4330 -0.2982 2.8162
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.643579 0.504565 -17.131 < 2e-16 ***
## male1 0.449590 0.127385 3.529 0.000417 ***
## age 0.065845 0.007745 8.502 < 2e-16 ***
## sys_bp 0.018684 0.002597 7.194 6.29e-13 ***
## glucose 0.007547 0.002112 3.573 0.000353 ***
## cigs_per_day 0.017858 0.005184 3.445 0.000571 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2114.4 on 2470 degrees of freedom
## Residual deviance: 1875.8 on 2465 degrees of freedom
## AIC: 1887.8
##
## Number of Fisher Scoring iterations: 5
##
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(5) = 238.54, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.16
## Pseudo-R² (McFadden) = 0.11
## AIC = 1887.82, BIC = 1922.69
##
## Standard errors: MLE
## --------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p VIF
## ------------------ ----------- ------ ------- -------- ------ ------
## (Intercept) 0.00 0.00 0.00 -17.13 0.00
## male1 1.57 1.22 2.01 3.53 0.00 1.16
## age 1.07 1.05 1.08 8.50 0.00 1.17
## sys_bp 1.02 1.01 1.02 7.19 0.00 1.14
## glucose 1.01 1.00 1.01 3.57 0.00 1.01
## cigs_per_day 1.02 1.01 1.03 3.44 0.00 1.20
## --------------------------------------------------------------------
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Using a threshold for prediction = 0.50
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1337 219
## 1 8 16
##
## Accuracy : 0.8563
## 95% CI : (0.8381, 0.8733)
## No Information Rate : 0.8513
## P-Value [Acc > NIR] : 0.3
##
## Kappa : 0.0987
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.06809
## Specificity : 0.99405
## Pos Pred Value : 0.66667
## Neg Pred Value : 0.85925
## Prevalence : 0.14873
## Detection Rate : 0.01013
## Detection Prevalence : 0.01519
## Balanced Accuracy : 0.53107
##
## 'Positive' Class : 1
##
The number of cigarettes per day is a relevant factor, although being a current smoker is not. Since there is an interaction between being male, current smoker and having hypertension, it is worth exploring whether including these interactions would improve the performance of the model.
However, the results show that the coefficients associated with male:current_smoker and current_smoker:prevalent_hyp are not significant - especially after correction. Also, they are highly collinear with cigs_per_day - which becomes not significant - and do not improve the model performance.
For all these resasons, these interactions will not be included in the final model.
# Try to include `male:current_smoker` and `current_smoker:prevalent_hyp`
EVs <- c(
"male", "age", "sys_bp", "glucose", "male:current_smoker", "current_smoker:prevalent_hyp"
)
phat <- do_logistic_regression(EVs, thresh = 0.5)
##
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0959 -0.5871 -0.4262 -0.2940 2.7761
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.071834 0.575043 -14.037 < 2e-16 ***
## male1 0.411137 0.177391 2.318 0.020466 *
## age 0.064531 0.007790 8.284 < 2e-16 ***
## sys_bp 0.013450 0.003398 3.959 7.54e-05 ***
## glucose 0.007486 0.002101 3.563 0.000367 ***
## male0:current_smoker1 0.386439 0.213094 1.813 0.069760 .
## male1:current_smoker1 0.560270 0.206397 2.715 0.006637 **
## current_smoker0:prevalent_hyp1 0.472103 0.200758 2.352 0.018693 *
## current_smoker1:prevalent_hyp1 0.287916 0.201998 1.425 0.154058
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2114.4 on 2470 degrees of freedom
## Residual deviance: 1871.3 on 2462 degrees of freedom
## AIC: 1889.3
##
## Number of Fisher Scoring iterations: 5
##
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(8) = 243.03, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.16
## Pseudo-R² (McFadden) = 0.11
## AIC = 1889.32, BIC = 1941.64
##
## Standard errors: MLE
## -------------------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p
## ------------------------------------ ----------- ------ ------- -------- ------
## (Intercept) 0.00 0.00 0.00 -14.04 0.00
## male1 1.51 1.07 2.14 2.32 0.02
## age 1.07 1.05 1.08 8.28 0.00
## sys_bp 1.01 1.01 1.02 3.96 0.00
## glucose 1.01 1.00 1.01 3.56 0.00
## male0:current_smoker1 1.47 0.97 2.23 1.81 0.07
## male1:current_smoker1 1.75 1.17 2.62 2.71 0.01
## current_smoker0:prevalent_hyp1 1.60 1.08 2.38 2.35 0.02
## current_smoker1:prevalent_hyp1 1.33 0.90 1.98 1.43 0.15
## -------------------------------------------------------------------------------
##
## -------------------------------------------
## VIF
## ------------------------------------ ------
## (Intercept)
## male1 2.24
## age 1.18
## sys_bp 1.96
## glucose 1.01
## male0:current_smoker1 4.18
## male1:current_smoker1 4.18
## current_smoker0:prevalent_hyp1 3.57
## current_smoker1:prevalent_hyp1 3.57
## -------------------------------------------
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Using a threshold for prediction = 0.50
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1332 221
## 1 13 14
##
## Accuracy : 0.8519
## 95% CI : (0.8334, 0.8691)
## No Information Rate : 0.8513
## P-Value [Acc > NIR] : 0.4892
##
## Kappa : 0.0786
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.059574
## Specificity : 0.990335
## Pos Pred Value : 0.518519
## Neg Pred Value : 0.857695
## Prevalence : 0.148734
## Detection Rate : 0.008861
## Detection Prevalence : 0.017089
## Balanced Accuracy : 0.524955
##
## 'Positive' Class : 1
##
We also decided to try to model age as a sigmoid, under the assumption that the odds of a CHD risk in ten years were overall smaller for people < 40 and would increase at a steeper rate afterwards, however the perfomance of the model did not change.
sigmoid_function <- function(x, a, b, c, d) {
# Sigmoid function formula: a + (b - a) / (1 + exp(-c * (x - d)))
return(a + (b - a) / (1 + exp(-c * (x - d))))
}
# Define the parameters for the sigmoid function to control the curve
a_linear <- 0 # Start value (linear increase starts from 0)
b_linear <- 1 # End value of linear increase (1 represents 100%)
c_steepness <- 0.15 # Steepness factor for the sigmoid curve (adjust as needed)
d_transition <- 50 # Age at which the transition occurs (50 in your case)
train_sigmoid_age <- train %>%
mutate(age_sigmoid = sigmoid_function(age, a_linear, b_linear, c_steepness, d_transition))
test_sigmoid_age <- test %>%
mutate(age_sigmoid = sigmoid_function(age, a_linear, b_linear, c_steepness, d_transition))
plot(train_sigmoid_age$age, train_sigmoid_age$age_sigmoid)
# The proxy_age variable now has a smooth transition, increasing linearly until 50 and more steeply after that.
EVs <- c(
"male", "age_sigmoid", "sys_bp", "glucose", "cigs_per_day"
)
phat <- do_logistic_regression(
train_data = train_sigmoid_age,
test_data = test_sigmoid_age,
EVs, thresh = 0.1
)
##
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0739 -0.5929 -0.4279 -0.2942 2.7784
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.433353 0.401903 -16.007 < 2e-16 ***
## male1 0.449367 0.127351 3.529 0.000418 ***
## age_sigmoid 2.197337 0.258184 8.511 < 2e-16 ***
## sys_bp 0.018622 0.002598 7.167 7.64e-13 ***
## glucose 0.007502 0.002109 3.556 0.000376 ***
## cigs_per_day 0.017897 0.005184 3.452 0.000556 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2114.4 on 2470 degrees of freedom
## Residual deviance: 1874.9 on 2465 degrees of freedom
## AIC: 1886.9
##
## Number of Fisher Scoring iterations: 5
##
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(5) = 239.49, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.16
## Pseudo-R² (McFadden) = 0.11
## AIC = 1886.87, BIC = 1921.74
##
## Standard errors: MLE
## --------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p VIF
## ------------------ ----------- ------ ------- -------- ------ ------
## (Intercept) 0.00 0.00 0.00 -16.01 0.00
## male1 1.57 1.22 2.01 3.53 0.00 1.16
## age_sigmoid 9.00 5.43 14.93 8.51 0.00 1.17
## sys_bp 1.02 1.01 1.02 7.17 0.00 1.14
## glucose 1.01 1.00 1.01 3.56 0.00 1.01
## cigs_per_day 1.02 1.01 1.03 3.45 0.00 1.20
## --------------------------------------------------------------------
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Using a threshold for prediction = 0.10
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 584 39
## 1 761 196
##
## Accuracy : 0.4937
## 95% CI : (0.4687, 0.5186)
## No Information Rate : 0.8513
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1183
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8340
## Specificity : 0.4342
## Pos Pred Value : 0.2048
## Neg Pred Value : 0.9374
## Prevalence : 0.1487
## Detection Rate : 0.1241
## Detection Prevalence : 0.6057
## Balanced Accuracy : 0.6341
##
## 'Positive' Class : 1
##
We decided to include the following variables in the final model:
At this point, we evaluate the performance of the model in terms of sensitivity (TP/TP+FN) using a default threshold of 0.5 for binary decision on the test set.
Our aim is to maximise the amount of people at risk which are predicted to be so by the model.
# The main aim is to use the model to identify the highest amount of true positives
# and minimize false negatives
EVs <- c(
"male", "age", "sys_bp", "glucose", "cigs_per_day"
)
phat <- do_logistic_regression(EVs, thresh = 0.5)
##
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1053 -0.5899 -0.4330 -0.2982 2.8162
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.643579 0.504565 -17.131 < 2e-16 ***
## male1 0.449590 0.127385 3.529 0.000417 ***
## age 0.065845 0.007745 8.502 < 2e-16 ***
## sys_bp 0.018684 0.002597 7.194 6.29e-13 ***
## glucose 0.007547 0.002112 3.573 0.000353 ***
## cigs_per_day 0.017858 0.005184 3.445 0.000571 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2114.4 on 2470 degrees of freedom
## Residual deviance: 1875.8 on 2465 degrees of freedom
## AIC: 1887.8
##
## Number of Fisher Scoring iterations: 5
##
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(5) = 238.54, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.16
## Pseudo-R² (McFadden) = 0.11
## AIC = 1887.82, BIC = 1922.69
##
## Standard errors: MLE
## --------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p VIF
## ------------------ ----------- ------ ------- -------- ------ ------
## (Intercept) 0.00 0.00 0.00 -17.13 0.00
## male1 1.57 1.22 2.01 3.53 0.00 1.16
## age 1.07 1.05 1.08 8.50 0.00 1.17
## sys_bp 1.02 1.01 1.02 7.19 0.00 1.14
## glucose 1.01 1.00 1.01 3.57 0.00 1.01
## cigs_per_day 1.02 1.01 1.03 3.44 0.00 1.20
## --------------------------------------------------------------------
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Using a threshold for prediction = 0.50
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1337 219
## 1 8 16
##
## Accuracy : 0.8563
## 95% CI : (0.8381, 0.8733)
## No Information Rate : 0.8513
## P-Value [Acc > NIR] : 0.3
##
## Kappa : 0.0987
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.06809
## Specificity : 0.99405
## Pos Pred Value : 0.66667
## Neg Pred Value : 0.85925
## Prevalence : 0.14873
## Detection Rate : 0.01013
## Detection Prevalence : 0.01519
## Balanced Accuracy : 0.53107
##
## 'Positive' Class : 1
##
Leaving the default threshold of 0.5 for binary prediction leads to identifying too few true positives. For this reason, we decided to maximise the Sensitivity (TP / (TP+FN)) even at the expense of assigning high risk of CHD to some people who are not.
This rationale is motivated by the fact that in this case the decision is not about doing or not doing an open-heart surgery, but rather to start to monitor people who might be at high risk. Plus, according the to the model, an accurate monitoring only requires inexpensive procedures: blood test and pressure measurements.
For this reason, it is of paramount important to include all the people who might be at risk even if they will turn out not to be so.
The threshold for risk detection is accordingly lowered to 0.1.
This leads to determine that almost 50% of the people who are not at risk will be deemed to be actually at risk. However, this choice will lead to a Sensitivity of about 80%.
NB: We also replace the cigs_per_day with units of tens of cigarettes
# The main aim is to use the model to identify the highest amount of true positives
# and minimize false negatives
# EVs <- c(
# "male", "age", "sys_bp", "glucose", "cigs_per_day"
# )
#
# phat <- do_logistic_regression(EVs, thresh = 0.1)
train_cig_factor <- train %>%
mutate(tens_cigs = round(cigs_per_day/10,0))
test_cig_factor <- test %>%
mutate(tens_cigs = round(cigs_per_day/10,0))
EVs <- c(
"male", "age", "sys_bp", "glucose", "tens_cigs"
)
phat <- do_logistic_regression(
EVs,
train_data = train_cig_factor,
test_data = test_cig_factor,
thresh = 0.1
)
##
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1007 -0.5886 -0.4341 -0.2981 2.8165
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.636592 0.503982 -17.137 < 2e-16 ***
## male1 0.448757 0.127401 3.522 0.000428 ***
## age 0.065813 0.007742 8.501 < 2e-16 ***
## sys_bp 0.018675 0.002597 7.191 6.45e-13 ***
## glucose 0.007510 0.002108 3.562 0.000368 ***
## tens_cigs 0.176640 0.051338 3.441 0.000580 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2114.4 on 2470 degrees of freedom
## Residual deviance: 1875.8 on 2465 degrees of freedom
## AIC: 1887.8
##
## Number of Fisher Scoring iterations: 5
##
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(5) = 238.54, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.16
## Pseudo-R² (McFadden) = 0.11
## AIC = 1887.81, BIC = 1922.69
##
## Standard errors: MLE
## -------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p VIF
## ----------------- ----------- ------ ------- -------- ------ ------
## (Intercept) 0.00 0.00 0.00 -17.14 0.00
## male1 1.57 1.22 2.01 3.52 0.00 1.16
## age 1.07 1.05 1.08 8.50 0.00 1.17
## sys_bp 1.02 1.01 1.02 7.19 0.00 1.14
## glucose 1.01 1.00 1.01 3.56 0.00 1.01
## tens_cigs 1.19 1.08 1.32 3.44 0.00 1.20
## -------------------------------------------------------------------
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Using a threshold for prediction = 0.10
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 588 38
## 1 757 197
##
## Accuracy : 0.4968
## 95% CI : (0.4719, 0.5218)
## No Information Rate : 0.8513
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1218
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8383
## Specificity : 0.4372
## Pos Pred Value : 0.2065
## Neg Pred Value : 0.9393
## Prevalence : 0.1487
## Detection Rate : 0.1247
## Detection Prevalence : 0.6038
## Balanced Accuracy : 0.6377
##
## 'Positive' Class : 1
##
Code to manually calculate the Confusion Matrix and its derivatives. Values are not shown. This was just an excercise.
# # Binary prediction from the estimated probability values
# thresh <- 0.1
# predicted <- ifelse(phat > thresh, 1, 0)
#
# actual_predicted <- tibble(
# actual = test$ten_year_chd,
# predicted = ifelse(phat > thresh, 1, 0)
# )
#
# cm <- actual_predicted %>%
# summarise(
# TP = sum(actual == 1 & predicted == 1),
# TN = sum(actual == 0 & predicted == 0),
# FP = sum(actual == 0 & predicted == 1),
# FN = sum(actual == 1 & predicted == 0)
# )
#
# cm %>% print
#
# TP <- cm$TP
# TN <- cm$TN
# FP <- cm$FP
# FN <- cm$FN
#
#
# accuracy <- (TP + TN) / (TP+TN+FP+FN)
# print(paste0("Accuracy = ", round(accuracy,4)))
#
# sensitivity <- TP / (TP + FN)
# print(paste0("Sensitivity / Recall / TPR = ", round(sensitivity,4)))
#
# specificity <- TN / (TN + FP)
# print(paste0("Specificity / TNR = ", round(specificity,4)))
#
# PPV <- TP / (TP+FP)
# print(paste0("Positive Predictive Value = ", round(PPV,2)))
#
# NPV <- TN / (TN+FN)
# print(paste0("Negative Predictive Value = ", round(NPV,2)))
This particular stepwise regression leads to a much more complex model. It is important to note that the particular order of the predictors can change the final estimated “best” model (based on AIC).
However what is very important is that the cigs_per_day is identified as an important predictor, despite our EDA repeatedly disconfirmed so. The reasons are therefore to be explored, however for the time being we will try to include it into our manually built model (see above.)
library(MASS)
fullModel = glm(
ten_year_chd ~ male + prevalent_hyp + bp_meds + age + sys_bp + glucose + tot_chol,
family = "binomial", data = train
)
fullModel = glm(ten_year_chd ~ ., family = "binomial", data = train)
nullModel = glm(ten_year_chd ~ 1, family = "binomial", data = train)
step_fit <- stepAIC(
fullModel, direction = 'backward',
scope = list(upper = fullModel, lower = nullModel), trace = 0
)
step_fit %>% summary()
##
## Call:
## glm(formula = ten_year_chd ~ male + age + education + cigs_per_day +
## prevalent_stroke + prevalent_hyp + sys_bp + glucose, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0023 -0.5868 -0.4197 -0.2915 2.7940
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.655170 0.587371 -13.033 < 2e-16 ***
## male1 0.387895 0.129655 2.992 0.002774 **
## age 0.060750 0.007983 7.610 2.74e-14 ***
## education2 -0.335970 0.150240 -2.236 0.025337 *
## education3 -0.352595 0.184679 -1.909 0.056232 .
## education4 -0.038528 0.195294 -0.197 0.843606
## cigs_per_day 0.019113 0.005227 3.657 0.000256 ***
## prevalent_stroke1 0.969730 0.601408 1.612 0.106868
## prevalent_hyp1 0.372800 0.161896 2.303 0.021295 *
## sys_bp 0.013282 0.003412 3.893 9.92e-05 ***
## glucose 0.007787 0.002106 3.698 0.000217 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2114.4 on 2470 degrees of freedom
## Residual deviance: 1860.2 on 2460 degrees of freedom
## AIC: 1882.2
##
## Number of Fisher Scoring iterations: 5
step_fit %>% jtools::summ(confint = T, vifs = T, digits = 2) %>% print()
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(10) = 254.11, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.17
## Pseudo-R² (McFadden) = 0.12
## AIC = 1882.24, BIC = 1946.18
##
## Standard errors: MLE
## ----------------------------------------------------------------------
## Est. 2.5% 97.5% z val. p VIF
## ----------------------- ------- ------- ------- -------- ------ ------
## (Intercept) -7.66 -8.81 -6.50 -13.03 0.00
## male1 0.39 0.13 0.64 2.99 0.00 1.19
## age 0.06 0.05 0.08 7.61 0.00 1.23
## education2 -0.34 -0.63 -0.04 -2.24 0.03 1.09
## education3 -0.35 -0.71 0.01 -1.91 0.06 1.09
## education4 -0.04 -0.42 0.34 -0.20 0.84 1.09
## cigs_per_day 0.02 0.01 0.03 3.66 0.00 1.22
## prevalent_stroke1 0.97 -0.21 2.15 1.61 0.11 1.01
## prevalent_hyp1 0.37 0.06 0.69 2.30 0.02 1.85
## sys_bp 0.01 0.01 0.02 3.89 0.00 1.97
## glucose 0.01 0.00 0.01 3.70 0.00 1.02
## ----------------------------------------------------------------------
# Estimated probability values from the logistic model
phat_step_fit = predict(step_fit, newdata = test, type = "response")
# ROC curve
# pROC::roc(test$ten_year_chd ~ phat_step_fit, plot = TRUE, print.auc = TRUE)
par(pty = "s")
pROC::roc(
test$ten_year_chd ~ phat_step_fit, plot = TRUE, print.auc = TRUE, legacy.axes = TRUE,
xlab="% False Positives", ylab = "% True Positives", col = "#377eb8", lwd = 4
)
##
## Call:
## roc.formula(formula = test$ten_year_chd ~ phat_step_fit, plot = TRUE, print.auc = TRUE, legacy.axes = TRUE, xlab = "% False Positives", ylab = "% True Positives", col = "#377eb8", lwd = 4)
##
## Data: phat_step_fit in 1345 controls (test$ten_year_chd 0) < 235 cases (test$ten_year_chd 1).
## Area under the curve: 0.7063
Evaluate predictivity of the stepwise model
# Binary prediction from the estimated probability values
thresh <- 0.1
predicted <- ifelse(phat_step_fit > thresh, 1, 0)
actual_predicted <- tibble(
actual = test$ten_year_chd,
predicted = ifelse(phat_step_fit > thresh, 1, 0)
)
caret::confusionMatrix(
as.factor(predicted), # predicted
as.factor(test$ten_year_chd), # actual
positive = "1" # the positive class in our case is "1"
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 608 38
## 1 737 197
##
## Accuracy : 0.5095
## 95% CI : (0.4845, 0.5344)
## No Information Rate : 0.8513
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1304
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8383
## Specificity : 0.4520
## Pos Pred Value : 0.2109
## Neg Pred Value : 0.9412
## Prevalence : 0.1487
## Detection Rate : 0.1247
## Detection Prevalence : 0.5911
## Balanced Accuracy : 0.6452
##
## 'Positive' Class : 1
##
Our sample is very unbalanced. We will select an equal number of people with and without risk in both train and test set to assess whether this will help model performance.
set.seed(9999)
min_train <- min(
nrow(train[train$ten_year_chd == 1,]),
nrow(train[train$ten_year_chd == 0,])
)
balanced_train <- train %>%
group_by(ten_year_chd) %>%
slice_sample(n = min_train)
min_test <- min(
nrow(test[test$ten_year_chd == 1,]),
nrow(test[test$ten_year_chd == 0,])
)
balanced_test <- test %>%
group_by(ten_year_chd) %>%
slice_sample(n = min_test)
# Full model with the selected variables
EVs <- c(
"male", "prevalent_hyp", "bp_meds", "education",
"age", "sys_bp", "bmi", "glucose", "tot_chol"
)
# Remove education, bp_meds, bmi, tot_chol since they have
# no effect on the model
EVs <- c(
"male", "prevalent_hyp",
"age", "sys_bp", "glucose"
)
# Remove prevalent_hyp since it is only marginally significant
EVs <- c(
"male", "age", "sys_bp", "glucose"
)
# The stepwise regression - below - identified cigs_per_day as an important factor.
# I will therefore add it to the model.
EVs <- c(
"male", "age", "sys_bp", "glucose", "cigs_per_day"
)
phat <- do_logistic_regression(
EVs,
train_data = balanced_train, test_data = balanced_test,
thresh = 0.4
)
##
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1032 -1.0034 -0.1717 1.0161 2.0910
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.238967 0.720868 -10.042 < 2e-16 ***
## male1 0.319629 0.169510 1.886 0.05935 .
## age 0.074519 0.010682 6.976 3.04e-12 ***
## sys_bp 0.015708 0.003688 4.259 2.06e-05 ***
## glucose 0.010113 0.003923 2.578 0.00994 **
## cigs_per_day 0.030757 0.007706 3.991 6.57e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1048.04 on 755 degrees of freedom
## Residual deviance: 910.19 on 750 degrees of freedom
## AIC: 922.19
##
## Number of Fisher Scoring iterations: 4
##
## MODEL INFO:
## Observations: 756
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
## Family: binomial
## Link function: logit
##
## MODEL FIT:
## χ²(5) = 137.85, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.22
## Pseudo-R² (McFadden) = 0.13
## AIC = 922.19, BIC = 949.96
##
## Standard errors: MLE
## --------------------------------------------------------------------
## exp(Est.) 2.5% 97.5% z val. p VIF
## ------------------ ----------- ------ ------- -------- ------ ------
## (Intercept) 0.00 0.00 0.00 -10.04 0.00
## male1 1.38 0.99 1.92 1.89 0.06 1.12
## age 1.08 1.06 1.10 6.98 0.00 1.18
## sys_bp 1.02 1.01 1.02 4.26 0.00 1.11
## glucose 1.01 1.00 1.02 2.58 0.01 1.01
## cigs_per_day 1.03 1.02 1.05 3.99 0.00 1.20
## --------------------------------------------------------------------
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Using a threshold for prediction = 0.40
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 115 44
## 1 120 191
##
## Accuracy : 0.6511
## 95% CI : (0.6061, 0.6941)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 2.818e-11
##
## Kappa : 0.3021
##
## Mcnemar's Test P-Value : 4.727e-09
##
## Sensitivity : 0.8128
## Specificity : 0.4894
## Pos Pred Value : 0.6141
## Neg Pred Value : 0.7233
## Prevalence : 0.5000
## Detection Rate : 0.4064
## Detection Prevalence : 0.6617
## Balanced Accuracy : 0.6511
##
## 'Positive' Class : 1
##
Use the following model - estimated on the whole dataset - to predict the 10-year risk of CHD according to the given input.
\[ logOdds_{10yr-CHD-risk} = -8.671 + 0.51*male + 0.061*age + 0.0173*sys\_bp + 0.00758*glucose + 0.0204 * cigs\_per\_day \]
\[ p = \frac{e^{logOdds}}{1 + e^{logOdds}} \]
NB: only for illustrational purposes. Read and accept the following Disclaimer first